Our current survey provider is SurveyMonkey,
blackstone contains several functions that makes the
process of reading SurveyMonkey data into R a more
manageable process and creates a codebook for the data along the
way.
SurveyMonkey exports data with two header rows, which does not work
with R where tibbles and dataframes can only have one row
of names.
Here is how to import data from SurveyMonkey using example data
provided with blackstone, this is a fake dataset of a pre
(baseline) survey. There are three steps to this process:
# File path for pre example data:
pre_data_fp <- blackstone::blackstoneExample("sm_data_pre.csv")
# 1. Create the codebook:
codebook_pre <- blackstone::createCodebook(pre_data_fp)
codebook_pre
## # A tibble: 32 × 5
## header_1 header_2 combined_header position variable_name
## <chr> <chr> <chr> <int> <chr>
## 1 Respondent ID <NA> Respondent ID 1 respondent_id
## 2 Collector ID <NA> Collector ID 2 collector_id
## 3 Start Date <NA> Start Date 3 start_date
## 4 End Date <NA> End Date 4 end_date
## 5 IP Address <NA> IP Address 5 ip_address
## # ℹ 27 more rows
For this codebook, the first column header_1 is the
first header from the SurveyMonkey data, the second column
header_2 is the second header, the third column
combined_header is the combination of the two headers,
position is the column number position for each
combined_header, and variable_name is a
cleaned up version for combined_header and will be the
column to edit to change the column names later on to shorter and more
meaningful names.
variable_name will be the column that renames all the
variables in the SurveyMonkey data.
# Step 2. Edit the codebook:
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_pre[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_pre[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_pre[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_pre <- codebook_pre %>% dplyr::mutate(
variable_name = dplyr::case_match(
variable_name, # column to match
'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
'with_which_gender_do_you_most_closely_identify' ~ "gender",
'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
'are_you_a_first_generation_college_student' ~ "first_gen",
names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
names(ability_names) ~ ability_names[variable_name], # Same for `ability_names`
names(ethics_names) ~ ethics_names[variable_name], # Same for `ability_names`
.default = variable_name # returns default value from original `variable_name` if not changed.
)
)
codebook_pre
## # A tibble: 32 × 5
## header_1 header_2 combined_header position variable_name
## <chr> <chr> <chr> <int> <chr>
## 1 Respondent ID <NA> Respondent ID 1 respondent_id
## 2 Collector ID <NA> Collector ID 2 collector_id
## 3 Start Date <NA> Start Date 3 start_date
## 4 End Date <NA> End Date 4 end_date
## 5 IP Address <NA> IP Address 5 ip_address
## # ℹ 27 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_pre, file = "{filepath-to-codebok}")
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
pre_data <- blackstone::readRenameData(pre_data_fp, codebook = codebook_pre)
pre_data
## # A tibble: 100 × 32
## respondent_id collector_id start_date end_date ip_address email_address
## <dbl> <dbl> <date> <date> <chr> <chr>
## 1 114628000001 431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2 114628000002 431822954 2024-06-21 2024-06-22 110.241.132.50 mstamm@hermi…
## 3 114628000003 431822954 2024-06-14 2024-06-15 165.58.112.64 precious.fei…
## 4 114628000004 431822954 2024-06-15 2024-06-16 49.34.121.147 ines52@gmail…
## 5 114628000005 431822954 2024-06-15 2024-06-16 115.233.66.80 franz44@hotm…
## # ℹ 95 more rows
## # ℹ 26 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## # knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
## # research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
## # research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
## # ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
## # ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …
The SurveyMonkey example data is now imported with names taken from
the codebook column variable_name:
names(pre_data)
## [1] "respondent_id" "collector_id" "start_date" "end_date"
## [5] "ip_address" "email_address" "first_name" "last_name"
## [9] "unique_id" "knowledge" "research_1" "research_2"
## [13] "research_3" "research_4" "research_5" "research_6"
## [17] "research_7" "research_8" "ability_1" "ability_2"
## [21] "ability_3" "ability_4" "ability_5" "ability_6"
## [25] "ethics_1" "ethics_2" "ethics_3" "ethics_4"
## [29] "ethics_5" "gender" "ethnicity" "first_gen"
# File path for pre example data:
post_data_fp <- blackstone::blackstoneExample("sm_data_post.csv")
# 1. Create the codebook:
codebook_post <- blackstone::createCodebook(post_data_fp)
codebook_post
## # A tibble: 37 × 5
## header_1 header_2 combined_header position variable_name
## <chr> <chr> <chr> <int> <chr>
## 1 Respondent ID <NA> Respondent ID 1 respondent_id
## 2 Collector ID <NA> Collector ID 2 collector_id
## 3 Start Date <NA> Start Date 3 start_date
## 4 End Date <NA> End Date 4 end_date
## 5 IP Address <NA> IP Address 5 ip_address
## # ℹ 32 more rows
# Step 2. Edit the codebook:
# Set up sequential naming convections for matrix-style questions with shared likert scale response options:
# 8 items that are matrix-style likert scales- turned into a scale called `research`- here is how to easily name them all at once:
# Rows 11 to 18 belong to the "research" matrix question (you will have to look at the codebook and match the header_1 and header_2 to variable_name to change)
research_items <- codebook_post[["variable_name"]][11:18]
research_names <- paste0("research_", seq_along(research_items)) %>% purrr::set_names(., research_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ability`- Rows 19 to 24 named `variable_name`:
ability_items <- codebook_post[["variable_name"]][19:24]
ability_names <- paste0("ability_", seq_along(ability_items)) %>% purrr::set_names(., ability_items) # Create a new named vector of names for these columns
# 6 items that are matrix-style likert scales- turned into a scale called `ethics`- Rows 19 to 24 named `variable_name`:
ethics_items <- codebook_post[["variable_name"]][25:29]
ethics_names <- paste0("ethics_", seq_along(ethics_items)) %>% purrr::set_names(., ethics_items) # Create a new named vector of names for these columns
# 5 items that are Open-ended follow up when corresponeding ethics items were answered "Strongly disagree"or "Disagree"- Rows 30 to 34 named `variable_name`:
ethics_items_oe <- codebook_post[["variable_name"]][30:34]
ethics_names_oe <- paste0("ethics_", seq_along(ethics_items), "_oe") %>% purrr::set_names(., ethics_items_oe) # Create a new named vector of names for these columns
# Edit the `variable_names` column: Use dplyr::mutate() and dplyr::case_match() to change the column `variable_name`:
codebook_post <- codebook_post %>% dplyr::mutate(
variable_name = dplyr::case_match(
variable_name, # column to match
'custom_data_1' ~ "unique_id", # changes 'custom_data_1' to "unique_id"
'to_what_extent_are_you_knowledgeable_in_conducting_research_in_your_field_of_study' ~ "knowledge",
'with_which_gender_do_you_most_closely_identify' ~ "gender",
'which_race_ethnicity_best_describes_you_please_choose_only_one' ~ "ethnicity",
'are_you_a_first_generation_college_student' ~ "first_gen",
names(research_names) ~ research_names[variable_name], # takes the above named vector and when the name matches, applies new value in that position as replacement.
names(ability_names) ~ ability_names[variable_name], # Same for `ability_names`
names(ethics_names) ~ ethics_names[variable_name], # Same for `ability_names`
names(ethics_names_oe) ~ ethics_names_oe[variable_name], # Same for `ethics_names_oe`
.default = variable_name # returns default value from original `variable_name` if not changed.
)
)
codebook_post
## # A tibble: 37 × 5
## header_1 header_2 combined_header position variable_name
## <chr> <chr> <chr> <int> <chr>
## 1 Respondent ID <NA> Respondent ID 1 respondent_id
## 2 Collector ID <NA> Collector ID 2 collector_id
## 3 Start Date <NA> Start Date 3 start_date
## 4 End Date <NA> End Date 4 end_date
## 5 IP Address <NA> IP Address 5 ip_address
## # ℹ 32 more rows
# Write out the edited codebook to save for future use-
# Be sure to double check questions match new names before writing out:
# readr::write_csv(codebook_post, file = "{filepath-to-codebok}")
# 3. Read in the data and rename the vars using readRenameData(), passing the file path and the edited codebook:
post_data <- blackstone::readRenameData(post_data_fp, codebook = codebook_post)
post_data
## # A tibble: 100 × 37
## respondent_id collector_id start_date end_date ip_address email_address
## <dbl> <dbl> <date> <date> <chr> <chr>
## 1 114628000001 431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2 114628000002 431822954 2024-06-21 2024-06-22 110.241.132.50 mstamm@hermi…
## 3 114628000003 431822954 2024-06-14 2024-06-15 165.58.112.64 precious.fei…
## 4 114628000004 431822954 2024-06-15 2024-06-16 49.34.121.147 ines52@gmail…
## 5 114628000005 431822954 2024-06-15 2024-06-16 115.233.66.80 franz44@hotm…
## # ℹ 95 more rows
## # ℹ 31 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## # knowledge <chr>, research_1 <chr>, research_2 <chr>, research_3 <chr>,
## # research_4 <chr>, research_5 <chr>, research_6 <chr>, research_7 <chr>,
## # research_8 <chr>, ability_1 <chr>, ability_2 <chr>, ability_3 <chr>,
## # ability_4 <chr>, ability_5 <chr>, ability_6 <chr>, ethics_1 <chr>,
## # ethics_2 <chr>, ethics_3 <chr>, ethics_4 <chr>, ethics_5 <chr>, …
Add pre and post prefixes to all variables that will be merged, (i.e. the survey items that differ pre-post the SM items and demos are identical):
# Pre data:
pre_data <- pre_data %>% rename_with(~ paste0("pre_", .), .cols = c(knowledge:ethics_5))
# Pre data:
post_data <- post_data %>% rename_with(~ paste0("post_", .), .cols = c(knowledge:ethics_5_oe))
# left_join() will automatically join by all the shared columns, to silence the message in the future add the 'by = join_by()' as an arg:
sm_data <- pre_data %>% dplyr::left_join(post_data, by = join_by(respondent_id, collector_id, start_date, end_date, ip_address, email_address,
first_name, last_name, unique_id, gender, ethnicity, first_gen))
sm_data
## # A tibble: 100 × 57
## respondent_id collector_id start_date end_date ip_address email_address
## <dbl> <dbl> <date> <date> <chr> <chr>
## 1 114628000001 431822954 2024-06-05 2024-06-06 227.224.138.113 coraima59@me…
## 2 114628000002 431822954 2024-06-21 2024-06-22 110.241.132.50 mstamm@hermi…
## 3 114628000003 431822954 2024-06-14 2024-06-15 165.58.112.64 precious.fei…
## 4 114628000004 431822954 2024-06-15 2024-06-16 49.34.121.147 ines52@gmail…
## 5 114628000005 431822954 2024-06-15 2024-06-16 115.233.66.80 franz44@hotm…
## # ℹ 95 more rows
## # ℹ 51 more variables: first_name <chr>, last_name <chr>, unique_id <dbl>,
## # pre_knowledge <chr>, pre_research_1 <chr>, pre_research_2 <chr>,
## # pre_research_3 <chr>, pre_research_4 <chr>, pre_research_5 <chr>,
## # pre_research_6 <chr>, pre_research_7 <chr>, pre_research_8 <chr>,
## # pre_ability_1 <chr>, pre_ability_2 <chr>, pre_ability_3 <chr>,
## # pre_ability_4 <chr>, pre_ability_5 <chr>, pre_ability_6 <chr>, …
## Knowledge scale
levels_knowledge <- c("Not knowledgeable at all", "A little knowledgeable", "Somewhat knowledgeable", "Very knowledgeable", "Extremely knowledgeable")
## Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
## Ability Items scale:
levels_min_ext <- c("Minimal", "Slight", "Moderate", "Good", "Extensive")
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")
# Demographic levels:
gender_levels <- c("Female","Male","Non-binary", "Do not wish to specify")
ethnicity_levels <- c("White (Non-Hispanic/Latino)", "Asian", "Black", "Hispanic or Latino", "American Indian or Alaskan Native",
"Native Hawaiian or other Pacific Islander", "Do not wish to specify")
first_gen_levels <- c("Yes", "No", "I'm not sure")
# Use mutate() for convert each item in each scale to a factor with vectors above, across() will perform a function for items selected using contains() or can be selected
# by variables names individually using a character vector: _knowledge or use c("pre_knowledg","post_knowledge")
# Also create new numeric variables for all the likert scale items and use the suffix '_num' to denote numeric:
sm_data <- sm_data %>% dplyr::mutate(dplyr::across(tidyselect::contains("_knowledge"), ~ factor(., levels = levels_knowledge)), # match each name pattern to select to each factor level
dplyr::across(tidyselect::contains("_knowledge"), as.numeric, .names = "{.col}_num"), # create new numeric items for all knowledge items
dplyr::across(tidyselect::contains("research_"), ~ factor(., levels = levels_confidence)),
dplyr::across(tidyselect::contains("research_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all research items
dplyr::across(tidyselect::contains("ability_"), ~ factor(., levels = levels_min_ext)),
dplyr::across(tidyselect::contains("ability_"), as.numeric, .names = "{.col}_num"), # create new numeric items for all ability items
# select ethics items but not the open_ended responses:
dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), ~ factor(., levels = levels_agree5)),
dplyr::across(tidyselect::contains("ethics_") & !tidyselect::contains("_oe"), as.numeric, .names = "{.col}_num"), # new numeric items for all ethics items
# individually convert all demographics to factor variables:
gender = factor(gender, levels = gender_levels),
ethnicity = factor(ethnicity, levels = ethnicity_levels),
first_gen = factor(first_gen, levels = first_gen_levels),
)
The most common task is creating frequency tables of counts and
percentages for likert scale items, blackstone has the
likertTable() for that:
# Research items pre and post frequency table, with counts and percentages: use levels_confidence character vector
# use likertTable to return frequency table, passing the scale_labels: (can also label the individual questions using the arg question_label)
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>%
blackstone::likertTable(., scale_labels = levels_confidence)
Question | Not at all | Slightly | Somewhat | Very | Extremely | n |
|---|---|---|---|---|---|---|
pre_research_1 | 10 (10%) | 8 (8%) | 41 (41%) | 27 (27%) | 14 (14%) | 100 |
pre_research_2 | 56 (56%) | 19 (19%) | 8 (8%) | 9 (9%) | 8 (8%) | 100 |
pre_research_3 | 32 (32%) | 23 (23%) | 15 (15%) | 20 (20%) | 10 (10%) | 100 |
pre_research_4 | 9 (9%) | 24 (24%) | 32 (32%) | 24 (24%) | 11 (11%) | 100 |
pre_research_5 | 40 (40%) | 18 (18%) | 21 (21%) | 13 (13%) | 8 (8%) | 100 |
pre_research_6 | 17 (17%) | 25 (25%) | 25 (25%) | 16 (16%) | 17 (17%) | 100 |
pre_research_7 | 59 (59%) | 13 (13%) | 11 (11%) | 10 (10%) | 7 (7%) | 100 |
pre_research_8 | 21 (21%) | 19 (19%) | 23 (23%) | 19 (19%) | 18 (18%) | 100 |
post_research_1 | 2 (2%) | 26 (26%) | 23 (23%) | 25 (25%) | 24 (24%) | 100 |
post_research_2 | 12 (12%) | 14 (14%) | 12 (12%) | 14 (14%) | 48 (48%) | 100 |
post_research_3 | 8 (8%) | 22 (22%) | 21 (21%) | 23 (23%) | 26 (26%) | 100 |
post_research_4 | 2 (2%) | 24 (24%) | 25 (25%) | 19 (19%) | 30 (30%) | 100 |
post_research_5 | 14 (14%) | 19 (19%) | 19 (19%) | 19 (19%) | 29 (29%) | 100 |
post_research_6 | 5 (5%) | 3 (3%) | 24 (24%) | 28 (28%) | 40 (40%) | 100 |
post_research_7 | 11 (11%) | 10 (10%) | 14 (14%) | 17 (17%) | 48 (48%) | 100 |
post_research_8 | 4 (4%) | 7 (7%) | 23 (23%) | 23 (23%) | 43 (43%) | 100 |
# Another way to make a list of many freq_tables to print out with other data analysis later on,
# using pmap() to do multiple likertTable() at once:
# Set up tibbles of each set of scales that contain all pre and post data:
# research:
research_df <- sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor))
# knowledge:
knowledge_df <- sm_data %>% dplyr::select(tidyselect::contains("_knowledge") & !tidyselect::contains("_num") & where(is.factor))
# ability:
ability_df <- sm_data %>% dplyr::select(tidyselect::contains("ability_") & !tidyselect::contains("_num") & where(is.factor))
# ethics:
ethics_df <- sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_oe") & !tidyselect::contains("_num") & where(is.factor))
# set up tibble with the columns as the args to pass to likertTable(), each row of the column `df` is the tibble of items and
# each row of `scale_labels` is the vector of likert scale labels:
freq_params <- tribble(
~df, ~scale_labels, # name of columns
knowledge_df, levels_knowledge,
research_df, levels_confidence,
ability_df, levels_min_ext,
ethics_df, levels_agree5
)
# Create a named list of frequency tables
freq_tables <- freq_params %>% purrr::pmap(blackstone::likertTable) %>%
purrr::set_names(., c("Knowledge Items", "Research Items", "Ability Items", "Ethics Items"))
# can select the list by position or by name:
# freq_tables[[1]]
freq_tables[["Knowledge Items"]]
Question | Not | A little | Somewhat | Very | Extremely | n |
|---|---|---|---|---|---|---|
pre_knowledge | 65 (65%) | 11 (11%) | 9 (9%) | 9 (9%) | 6 (6%) | 100 |
post_knowledge | 7 (7%) | 14 (14%) | 9 (9%) | 24 (24%) | 46 (46%) | 100 |
freq_tables[["Research Items"]]
Question | Not at all | Slightly | Somewhat | Very | Extremely | n |
|---|---|---|---|---|---|---|
pre_research_1 | 10 (10%) | 8 (8%) | 41 (41%) | 27 (27%) | 14 (14%) | 100 |
pre_research_2 | 56 (56%) | 19 (19%) | 8 (8%) | 9 (9%) | 8 (8%) | 100 |
pre_research_3 | 32 (32%) | 23 (23%) | 15 (15%) | 20 (20%) | 10 (10%) | 100 |
pre_research_4 | 9 (9%) | 24 (24%) | 32 (32%) | 24 (24%) | 11 (11%) | 100 |
pre_research_5 | 40 (40%) | 18 (18%) | 21 (21%) | 13 (13%) | 8 (8%) | 100 |
pre_research_6 | 17 (17%) | 25 (25%) | 25 (25%) | 16 (16%) | 17 (17%) | 100 |
pre_research_7 | 59 (59%) | 13 (13%) | 11 (11%) | 10 (10%) | 7 (7%) | 100 |
pre_research_8 | 21 (21%) | 19 (19%) | 23 (23%) | 19 (19%) | 18 (18%) | 100 |
post_research_1 | 2 (2%) | 26 (26%) | 23 (23%) | 25 (25%) | 24 (24%) | 100 |
post_research_2 | 12 (12%) | 14 (14%) | 12 (12%) | 14 (14%) | 48 (48%) | 100 |
post_research_3 | 8 (8%) | 22 (22%) | 21 (21%) | 23 (23%) | 26 (26%) | 100 |
post_research_4 | 2 (2%) | 24 (24%) | 25 (25%) | 19 (19%) | 30 (30%) | 100 |
post_research_5 | 14 (14%) | 19 (19%) | 19 (19%) | 19 (19%) | 29 (29%) | 100 |
post_research_6 | 5 (5%) | 3 (3%) | 24 (24%) | 28 (28%) | 40 (40%) | 100 |
post_research_7 | 11 (11%) | 10 (10%) | 14 (14%) | 17 (17%) | 48 (48%) | 100 |
post_research_8 | 4 (4%) | 7 (7%) | 23 (23%) | 23 (23%) | 43 (43%) | 100 |
freq_tables[["Ability Items"]]
Question | Minimal | Slight | Moderate | Good | Extensive | n |
|---|---|---|---|---|---|---|
pre_ability_1 | 9 (9%) | 28 (28%) | 37 (37%) | 15 (15%) | 11 (11%) | 100 |
pre_ability_2 | 24 (24%) | 14 (14%) | 27 (27%) | 22 (22%) | 13 (13%) | 100 |
pre_ability_3 | 26 (26%) | 30 (30%) | 19 (19%) | 19 (19%) | 6 (6%) | 100 |
pre_ability_4 | 43 (43%) | 27 (27%) | 13 (13%) | 4 (4%) | 13 (13%) | 100 |
pre_ability_5 | 32 (32%) | 18 (18%) | 26 (26%) | 17 (17%) | 7 (7%) | 100 |
pre_ability_6 | 26 (26%) | 12 (12%) | 18 (18%) | 26 (26%) | 18 (18%) | 100 |
post_ability_1 | 3 (3%) | 20 (20%) | 23 (23%) | 20 (20%) | 34 (34%) | 100 |
post_ability_2 | 2 (2%) | 11 (11%) | 11 (11%) | 29 (29%) | 47 (47%) | 100 |
post_ability_3 | 11 (11%) | 19 (19%) | 22 (22%) | 15 (15%) | 33 (33%) | 100 |
post_ability_4 | 9 (9%) | 6 (6%) | 9 (9%) | 23 (23%) | 53 (53%) | 100 |
post_ability_5 | 12 (12%) | 19 (19%) | 16 (16%) | 27 (27%) | 26 (26%) | 100 |
post_ability_6 | 6 (6%) | 11 (11%) | 29 (29%) | 29 (29%) | 25 (25%) | 100 |
freq_tables[["Ethics Items"]]
Question | Strongly | Disagree | Neither | Agree | Strongly | n |
|---|---|---|---|---|---|---|
pre_ethics_1 | 10 (10%) | 16 (16%) | 45 (45%) | 20 (20%) | 9 (9%) | 100 |
pre_ethics_2 | 20 (20%) | 15 (15%) | 27 (27%) | 23 (23%) | 15 (15%) | 100 |
pre_ethics_3 | 28 (28%) | 16 (16%) | 30 (30%) | 12 (12%) | 14 (14%) | 100 |
pre_ethics_4 | 50 (50%) | 17 (17%) | 12 (12%) | 6 (6%) | 15 (15%) | 100 |
pre_ethics_5 | 14 (14%) | 16 (16%) | 31 (31%) | 30 (30%) | 9 (9%) | 100 |
post_ethics_1 | 4 (4%) | 18 (18%) | 24 (24%) | 30 (30%) | 24 (24%) | 100 |
post_ethics_2 | 7 (7%) | 10 (10%) | 13 (13%) | 31 (31%) | 39 (39%) | 100 |
post_ethics_3 | 7 (7%) | 26 (26%) | 14 (14%) | 28 (28%) | 25 (25%) | 100 |
post_ethics_4 | 10 (10%) | 7 (7%) | 6 (6%) | 19 (19%) | 58 (58%) | 100 |
post_ethics_5 | 3 (3%) | 14 (14%) | 25 (25%) | 31 (31%) | 27 (27%) | 100 |
blackstone contains a function to create frequency
tables for demographics that can be grouped by a variable like role or
cohort as well: [groupedTable()].
# Set up labels for variables
# Labels for questions column of table, pass to question_labels argument:
demos_labels <- c('Gender' = "gender",
'Race/Ethnicity' = "ethnicity",
'First-Generation College Student' = "first_gen")
sm_data %>% dplyr::select(gender, ethnicity, first_gen) %>% # select the demographic vars
blackstone::groupedTable(question_labels = demos_labels) # pass the new labels for the 'Question' column.
Question | Response | n = 1001 |
|---|---|---|
Gender | ||
Female | 47 (47%) | |
Male | 50 (50%) | |
Non-binary | 2 (2%) | |
Do not wish to specify | 1 (1%) | |
Race/Ethnicity | ||
White (Non-Hispanic/Latino) | 36 (36%) | |
Asian | 23 (23%) | |
Black | 7 (7%) | |
Hispanic or Latino | 18 (18%) | |
American Indian or Alaskan Native | 5 (5%) | |
Native Hawaiian or other Pacific Islander | 7 (7%) | |
Do not wish to specify | 4 (4%) | |
First-Generation | ||
Yes | 59 (59%) | |
No | 39 (39%) | |
I'm not sure | 2 (2%) | |
1n (%) | ||
sm_data %>% dplyr::select(tidyselect::contains("_knowledge") & tidyselect::contains("_num")) %>% # select knowledge pre and post numeric items
dplyr::mutate(knowledge_diff = post_knowledge_num - pre_knowledge_num) %>% # get difference of pre and post scores
rstatix::shapiro_test(knowledge_diff)
## # A tibble: 1 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 knowledge_diff 0.885 0.000000301
Data is not normally distributed for the knowledge items (since the p-value is < 0.05)- use a Wilcoxon test.
# Either use a pipe-friendly version of wilcox_test from `rstatix`, need to covert to long form and have `timing` as a variable:
knowledge_wilcoxon <- sm_data %>% dplyr::select(tidyselect::contains("_knowledge") & tidyselect::contains("_num")) %>%
tidyr::pivot_longer(tidyselect::contains(c("pre_", "post_")), names_to = "question", values_to = "response") %>%
tidyr::separate(.data$question, into = c("timing", "question"), sep = "_", extra = "merge") %>%
rstatix::wilcox_test(response ~ timing, paired = TRUE, detailed = TRUE)
# Or use the simple base R wilcox.test with each pre and post item:
wilcox.test(sm_data[["post_knowledge_num"]], sm_data[["pre_knowledge_num"]], paired = TRUE)
##
## Wilcoxon signed rank test with continuity correction
##
## data: sm_data[["post_knowledge_num"]] and sm_data[["pre_knowledge_num"]]
## V = 3800, p-value = 0.0000000000001
## alternative hypothesis: true location shift is not equal to 0
Wilcoxon test is significant, there is a significant difference in pre and post scores of knowledge scores.
sm_data <- sm_data %>% dplyr::rowwise() %>% # Get the mean for each individual by row
dplyr::mutate(pre_research_mean = mean(dplyr::c_across(tidyselect::contains("pre_research_") & tidyselect::contains("_num"))), # pre mean for each individual
post_research_mean = mean(dplyr::c_across(tidyselect::contains("post_research_") & tidyselect::contains("_num"))), # post mean for each individual
diff_research = post_research_mean - pre_research_mean # get difference scores of pre and post means.
) %>% dplyr::ungroup()
sm_data %>% rstatix::shapiro_test(diff_research) # not significant
## # A tibble: 1 × 3
## variable statistic p
## <chr> <dbl> <dbl>
## 1 diff_research 0.991 0.720
Data is normally distributed for the research composite items (since the p-values is > 0.05)- use a T-test.
# Either use a pipe-friendly version of wilcox_test from `rstatix`, need to covert to long form and have `timing` as a variable:
research_t_test <- sm_data %>% dplyr::select(pre_research_mean, post_research_mean) %>% # select the pre and post means for research items
tidyr::pivot_longer(tidyselect::contains(c("pre_", "post_")), names_to = "question", values_to = "response") %>%
tidyr::separate(.data[["question"]], into = c("timing", "question"), sep = "_", extra = "merge") %>%
rstatix::t_test(response ~ timing, paired = TRUE, detailed = TRUE)
research_t_test
## # A tibble: 1 × 13
## estimate .y. group1 group2 n1 n2 statistic p df conf.low
## * <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1.02 response post pre 100 100 15.6 2.43e-28 99 0.890
## # ℹ 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
# Or use the simple base R wilcox.test with each pre and post item:
t.test(sm_data[["post_research_mean"]], sm_data[["pre_research_mean"]], paired = TRUE)
##
## Paired t-test
##
## data: sm_data[["post_research_mean"]] and sm_data[["pre_research_mean"]]
## t = 16, df = 99, p-value <0.0000000000000002
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## 0.8899 1.1501
## sample estimates:
## mean difference
## 1.02
T-test is significant, there is a mean difference in pre and post scores of 1.02.
blackstone has functions that create 3 types of charts
for data visualization: stacked bar charts, diverging stacked bar
charts, and arrow charts.
The functions for stacked bar charts and diverging stacked bar charts can use two different color palettes: a blue sequential palette or a blue-red diverging color palette.
The blue sequential palette should be used for all likert scales that have one clear direction like: Not at all confident, Slightly confident, Somewhat confident, Very confident, Extremely confident
The blue-red diverging color palette should be used if the items have a likert scale that is folded or runs from a negative to positive valence like this: Strongly disagree, Disagree, Neither agree nor disagree, Agree, Strongly agree
The next three sections show examples on how to use these functions.
The most common visual that is used with reporting at Blackstone
Research and Evaluation is a stacked bar chart, blackstone
has a function to that makes creating these charts fast and easy:
[stackedBarChart()].
stackedBarChart() takes in a tibble of factor/character
variables to turn into a stacked bar chart. The other requirement is a
character vector of scale labels for the likert scale that makes up the
items in the tibble (same as the one use to set them up as factors in
the data cleaning section).
stackedBarChart() uses the blue sequential
palette to color the bars and sorts the items by the ones with the
highest post items with the highest counts/percentages.# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
# select variables and pass them to `stackedBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE)
# Select variables and pass them to `stackedBarChart()` along with scale_labels, change the arguements `percent_label` and `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE, percent_label = FALSE, overall_n = FALSE)
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")
# select variables and pass them to `stackedBarChart()` along with scale_labels,
# change `fill_colors` to "div" to use the blue-red diverging color palette:
sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_num") & # select the factor variables for the ethics items
!tidyselect::contains("_oe") & where(is.factor)) %>%
blackstone::stackedBarChart(., scale_labels = levels_agree5, pre_post = TRUE, fill_colors = "div")
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# select variables and pass them to `stackedBarChart()` along with scale_labels, also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::stackedBarChart(., scale_labels = levels_confidence, pre_post = TRUE, question_labels = research_question_labels, question_order = TRUE)
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("post_research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# select variables and pass them to `stackedBarChart()` along with scale_labels, set pre_post to FALSE (default),
# also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("post_research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::stackedBarChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)
Another common visual that is used with reporting at Blackstone
Research and Evaluation is a diverging stacked bar chart, which I will
refer to from now on as a diverging bar chart. blackstone
has a function to make this type of chart, it is called:
[divBarChart()].
The diverging bar charts created using divBarChart(),
diverge just after the mid-point of the likert scale of the items
supplied to the function. See examples below.
divBarChart() has all of the same arguments as
stackedBarChart(), so using it has the same
requirements.
divBarChart() uses the blue sequential
palette to color the bars and sorts the items by the ones with the
highest post items with the highest counts/percentages.# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
# select variables and pass them to `divBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE)
# Select variables and pass them to `divBarChart()` along with scale_labels, change the arguements `percent_label` and `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE, percent_label = FALSE, overall_n = FALSE)
## Ethics Items scale:
levels_agree5 <- c("Strongly disagree", "Disagree", "Neither agree nor disagree", "Agree", "Strongly agree")
# select variables and pass them to `divBarChart()` along with scale_labels,
# change `fill_colors` to "div" to use the blue-red diverging color palette:
sm_data %>% dplyr::select(tidyselect::contains("ethics_") & !tidyselect::contains("_num") & # select the factor variables for the ethics items
!tidyselect::contains("_oe") & where(is.factor)) %>%
blackstone::divBarChart(., scale_labels = levels_agree5, pre_post = TRUE, fill_colors = "div")
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# select variables and pass them to `divBarChart()` along with scale_labels, also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::divBarChart(., scale_labels = levels_confidence, pre_post = TRUE, question_labels = research_question_labels, question_order = TRUE)
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("post_research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# select variables and pass them to `divBarChart()` along with scale_labels, set pre_post to FALSE (default),
# also pass research_question_labels to `question_labels` and set `question_order` to TRUE.
sm_data %>% dplyr::select(tidyselect::contains("post_research_") & !tidyselect::contains("_num") & where(is.factor)) %>% # select the factor variables for the research items
blackstone::divBarChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)
Arrow charts show the difference in means at two time points,
blackstone has two functions that create arrow charts:
arrowChart() and arrowChartGroup().
Both use a tibble of numeric pre-post data as the main input, and also require a character vector of scale labels for the numeric scale that makes up the items in the tibble. The rest of the arguments for the two arrow chart functions are the sames as the stacked bar chart functions.
arrowChart()arrowChart() sorts the items/arrows by the
ones with the highest post average on down and the arrows are the dark
blue color hex code #283251.# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
# select variables and pass them to `divBarChart()` along with scale_labels.
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChart(., scale_labels = levels_confidence)
# Select variables and pass them to `divBarChart()` along with scale_labels, change the arguement `overall_n` both to FALSE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChart(., scale_labels = levels_confidence, overall_n = FALSE)
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# Select variables and pass them to `arrowChart()` along with scale_labels, and also pass research_question_labels to `question_labels` and set `question_order` to TRUE:
sm_data %>% dplyr::select(tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChart(., scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)
arrowChartGroup()arrowChartGroup() allows the user to create an arrow
chart of pre-post averages grouped by a third variable, while also
showing the overall pre-post average as an arrow.
arrowChartGroup() sorts the items/arrows by
the ones with the highest post average on down and the arrows are
colored using the Qualitative Color Palette, which has 11 distinct
colors:
arrowChartGroup() returns pre-post averages for each
group passed to group_levels as well as an “Overall” which
is the whole sample and will always be the color black, also the order
of group_levels will also determining the order of the
arrows and legend.# Research Items scale:
levels_confidence <- c("Not at all confident", "Slightly confident", "Somewhat confident", "Very confident", "Extremely confident")
# select variables and pass them to `arrowChartGroup()` along with scale_labels, the grouping variable in `group` and the levels for each group in `group_levels`:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels, scale_labels = levels_confidence)
# Select variables and pass them to `divBarChart()` along with scale_labels, change the argument `overall_n` both to FALSE:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels,scale_labels = levels_confidence, overall_n = FALSE)
# Question labels as a named vector with the naming structure
# like this: c("new label" = "original variable name"), where the
# names are the new question labels and the old names are the values without pre or post prefixes:
# Here I will use paste0 to create 8 research items like they appear without prefixes:
research_question_labels <- paste0(paste0("research_", 1:8))
# Set new labels as names of `research_question_labels`
names(research_question_labels) <- c("Research relevant background literature", "Identify a scientific problem",
"Develop testable and realistic research questions", "Develop a falsifiable hypothesis",
"Conduct quantitative data analysis", "Design an experiment/Create a research design",
"Interpret findings and making recommendations", "Scientific or technical writing")
# Select variables and pass them to `arrowChart()` along with scale_labels, and also pass research_question_labels to `question_labels` and set `question_order` to TRUE:
sm_data %>% dplyr::select(gender, tidyselect::contains("research_") & tidyselect::contains("_num") & where(is.numeric)) %>% # select the numeric variables for the research items
blackstone::arrowChartGroup(., group = "gender", group_levels = gender_levels,scale_labels = levels_confidence, question_labels = research_question_labels, question_order = TRUE)